Problem Statement

This problem statement is based on the Shinkansen Bullet Train in Japan, and passengers’ experience with that mode of travel. The dataset contains a random sample of individuals who traveled on this train. The on-time performance of the trains along with passenger information is in a file named ‘Traveldata_train.csv’. These passengers were later asked to provide their feedback on various parameters related to the travel along with their overall experience. These collected details are in the survey report labeled ‘Surveydata_train.csv’.

In the survey, each passenger was explicitly asked whether they were satisfied with their overall travel experience or not, and that is captured in the data of the survey report under the variable labeled ‘Overall_Experience’.

The objective of this problem is to understand which parameters play an important role in swaying passenger feedback towards a positive scale and to build the most accurate model to predict passenger experience.

EDA and Data Preprocessing

We see that data contains NaN values, so we will need to do cleaning. But let's first explore the data further.

We see that:

  1. The dataset consists of 94,379 entries, has 9 columns, most of the values are non-null, but some are not. ID, Travel_Class and Travel_Distance have no null values.
  2. Data types of the columns are int, float and object.

We see that:

  1. There are also 94,379 entries as in travel_train dataset. Most of the values are non-null. ID, Overall experience and Seat_Class columns have no null values.
  2. Most of the columns are of object type, except for ID and Overall_Experience which are int.
  3. We can use ID column to merge travel and survey train sets into one dataset to proceed with exploration and modeling.

Let's merge both sets into one dataset

We see that all columns were merged and no data was lost.

Let's continue with exploratory data analysis

We see that:

Let's check columns which have null_values.

We see that we imputed 300 values to Arrival_Delay_in_Mins

Let's focus on imputing:

Because this columns have substantial amount of missing values.

We see that in both cases if the person was satisfied with the trip or not the most often the value of Arrival_Time_Convenient is good, so let's use this value to impute all the missing values.

We see that the most often value is good so we impute it.

We see that the most often value is good so we impute it.

We see that the most often value is Loyal Customer so we impute it.

We see that the most often value is Business Travel so we impute it.

For now we will not drop outliers, but we will keep them in mind

Let's convert all the categorical values to numerical and then will use KNN imputer to impute remaining NaN values

We see that many of the categorical columns are ordinal, so let's use for them ordinal encoding

Modeling

First of all let's try XGBoost Classifier on this problem.

We see that ada_dTree seems the most promising, let's try to optimize it.

Let's use Hyperopt-SKLearn optimization to find the best model and best hyper-parameters

Let's try to get rid off outliers and see if it helps in generalising the model.

We see that there are possible outliers in Travel_Distance and Departure_Delay_in_Mins

It's hard to say what is metric of Travel_Distance, because the longest distance which can be traveled by Shinkansen train is 713.7 km (okyo–Shin-Aomori, https://www.nippon.com/en/features/h00077/) and median Travel_distance in our features is 1,926. So let's try to get rid of outliers with classic interquartile approach.

We see that it's only ~2% of the data, so probably it's safe to get rid of it.

We see that Departure Delay extreme values are ~13% of all dataset, so probably we shouldn't drop it as it may have big impact on customer experience

Let's check the performance of the best found model XGBoost after outliers has been dropped.

We see that validation accuracy has changed only slightly.

Let's check another found XGBoost Model during hyperopt model search as it has around the same validation accuracy as xgbmodel_1

We see that validation accuracy slightly dropped.

Let's try to impute all missing values in the initial dataset with KNN imputer and see if it improves the model performance

We see that the model performance didn't improve and even slightly dropped, so let's stick to the original approach and use the xgbmodel2 as the best performing and scaled to identify the most important features.

But let's check its robustness on the test set first

We see that the model is quite robust with validation accuracy of 95.31% and test accuracy on an unseen data of 94.51%.

Final Recommendations:

  1. I recommend to pay attention for improving Onboard Entertainment, Seat Comfort and Ease of online booking as this are the most important factors for customer satisfaction.
  2. Another recommendation is to continue imporoving customer loyalty programms as it is one of the most important factors.